Problem Statement

Business Context

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description

Importing necessary libraries

Loading the dataset

Data Overview

Exploratory Data Analysis (EDA)

Plotting histograms and boxplots for all the variables

Plotting all the features at one go

Data Pre-processing

Since we already have a separate test set, we don't need to divide data into train, valiation and test

Missing value imputation

Model Building

Model evaluation criterion

The nature of predictions made by the classification model will translate as follows:

Which metric to optimize?

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

Defining scorer to be used for cross-validation and hyperparameter tuning

We are now done with pre-processing and evaluation criterion, so let's start building the model.

Model Building on original data

Model Building with Oversampled data

Model Building with Undersampled data

HyperparameterTuning

Sample Parameter Grids

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }

param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }

param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }

param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }

param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }

param_grid = {'C': np.arange(0.1,1.1,0.1)}

param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }

Tuning AdaBoost using oversampled data

Tuning Random forest using undersampled data

Tuning Gradient Boosting using oversampled data

We have now tuned all the models, let's compare the performance of all tuned models and see which one is the best

Model performance comparison and choosing the final model

The AdaBoost model tuned with oversampled data is giving the best validation recall of 0.71 with no overfitting between train and validation set Let's check the model's performance on test set and then see the feature importance

Now we have our final model, so let's find out how our final model is performing on unseen test data.

Feature Importances

Let's use Pipelines to build the final model

Since we have only one datatype in the data, we don't need to use column transformer here

Business Insights and Conclusions

The AdaBoost model gives us the most generalized scores with less overfitting between the trainging and validation model. also from the abie graph, we see that V36,V18 and V26 sensors are giving us high failure rate compared to others.